48 research outputs found
Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems
Current HPC systems provide memory resources that are statically configured
and tightly coupled with compute nodes. However, workloads on HPC systems are
evolving. Diverse workloads lead to a need for configurable memory resources to
achieve high performance and utilization. In this study, we evaluate a memory
subsystem design leveraging CXL-enabled memory pooling. Two promising use cases
of composable memory subsystems are studied -- fine-grained capacity
provisioning and scalable bandwidth provisioning. We developed an emulator to
explore the performance impact of various memory compositions. We also provide
a profiler to identify the memory usage patterns in applications and their
optimization opportunities. Seven scientific and six graph applications are
evaluated on various emulated memory configurations. Three out of seven
scientific applications had less than 10% performance impact when the pooled
memory backed 75% of their memory footprint. The results also show that a
dynamically configured high-bandwidth system can effectively support
bandwidth-intensive unstructured mesh-based applications like OpenFOAM.
Finally, we identify interference through shared memory pools as a practical
challenge for adoption on HPC systems.Comment: 10 pages, 13 figures. Accepted for publication in Workshop on Memory
Centric High Performance Computing (MCHPC'22) at SC2
tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads
Machine Learning applications on HPC systems have been gaining popularity in
recent years. The upcoming large scale systems will offer tremendous
parallelism for training through GPUs. However, another heavy aspect of Machine
Learning is I/O, and this can potentially be a performance bottleneck.
TensorFlow, one of the most popular Deep-Learning platforms, now offers a new
profiler interface and allows instrumentation of TensorFlow operations.
However, the current profiler only enables analysis at the TensorFlow platform
level and does not provide system-level information. In this paper, we extend
TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that
performs instrumentation through Darshan. We use the same Darshan shared
instrumentation library and implement a runtime attachment without using a
system preload. We can extract Darshan profiling data structures during
TensorFlow execution to enable analysis through the TensorFlow profiler. We
visualize the performance results through TensorBoard, the web-based TensorFlow
visualization tool. At the same time, we do not alter Darshan's existing
implementation. We illustrate tf-Darshan by performing two case studies on
ImageNet image and Malware classification. We show that by guiding optimization
using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by
selecting data for staging on fast tier storage. We also show that Darshan has
the potential of being used as a runtime library for profiling and providing
information for future optimization.Comment: Accepted for publication at the 2020 International Conference on
Cluster Computing (CLUSTER 2020
Leveraging HPC Profiling & Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations
Large-scale plasma simulations are critical for designing and developing
next-generation fusion energy devices and modeling industrial plasmas. BIT1 is
a massively parallel Particle-in-Cell code designed for specifically studying
plasma material interaction in fusion devices. Its most salient characteristic
is the inclusion of collision Monte Carlo models for different plasma species.
In this work, we characterize single node, multiple nodes, and I/O performances
of the BIT1 code in two realistic cases by using several HPC profilers, such as
perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting
function on-node performance is the main performance bottleneck. Strong scaling
tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two
test cases. We demonstrate that communication, load imbalance and
self-synchronization are important factors impacting the performance of the
BIT1 on large-scale runs.Comment: Accepted by the Euro-Par 2023 workshops (TDLPP 2023), prepared in the
standardized Springer LNCS format and consists of 12 pages, which includes
the main text, references, and figure
sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems
Large-scale simulations of plasmas are essential for advancing our
understanding of fusion devices, space, and astrophysical systems.
Particle-in-Cell (PIC) codes have demonstrated their success in simulating
numerous plasma phenomena on HPC systems. Today, flagship supercomputers
feature multiple GPUs per compute node to achieve unprecedented computing power
at high power efficiency. PIC codes require new algorithm design and
implementation for exploiting such accelerated platforms. In this work, we
design and optimize a three-dimensional implicit PIC code, called sputniPIC, to
run on a general multi-GPU compute node. We introduce a particle decomposition
data layout, in contrast to domain decomposition on CPU-based implementations,
to use particle batches for overlapping communication and computation on GPUs.
sputniPIC also natively supports different precision representations to achieve
speed up on hardware that supports reduced precision. We validate sputniPIC
through the well-known GEM challenge and provide performance analysis. We test
sputniPIC on three multi-GPU platforms and report a 200-800x performance
improvement with respect to the sputniPIC CPU OpenMP version performance. We
show that reduced precision could further improve performance by 45% to 80% on
the three platforms. Because of these performance improvements, on a single
node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC
simulations that were only possible using clusters.Comment: Accepted for publication at the 32nd International Symposium on
Computer Architecture and High Performance Computing (SBAC-PAD 2020
FAST-ASKAP Synergy: Quantifying Coexistent Tidal and Ram-Pressure Strippings in the NGC 4636 Group
Combining new HI data from a synergetic survey of ASKAP WALLABY and FAST with
the ALFALFA data, we study the effect of ram-pressure and tidal interactions in
the NGC 4636 group. We develop two parameters to quantify and disentangle these
two effects on gas stripping in HI-bearing galaxies: the strength of external
forces at the optical-disk edge, and the outside-in extents of HI-disk
stripping. We find that gas stripping is widespread in this group, affecting
80% of HI-detected non-merging galaxies, and that 34% are experiencing both
types of stripping. Among the galaxies experiencing both effects, the strengths
(and extents) of ram-pressure and tidal stripping are independent of each
other. Both strengths are correlated with HI-disk shrinkage. The tidal strength
is related to a rather uniform reddening of low-mass galaxies
() when tidal stripping is the dominating effect. In
contrast, ram pressure is not clearly linked to the color-changing patterns of
galaxies in the group. Combining these two stripping extents, we estimate the
total stripping extent, and put forward an empirical model that can describe
the decrease of HI richness as galaxies fall toward the group center. The
stripping timescale we derived decreases with distance to the center, from
around to
near the center. Gas-depletion happens
since crossing for HI-rich galaxies,
but much quicker for HI-poor ones. Our results quantify in a physically
motivated way the details and processes of environmental-effects-driven galaxy
evolution, and might assist in analyzing hydrodynamic simulations in an
observational way.Comment: 44 pages, 22 figures, 5 tables, accepted for publication in ApJ.
Tables 4 and 5 are also available in machine-readable for
Hypoplastic Left Heart Syndrome Current Considerations and Expectations
In the recent era, no congenital heart defect has undergone a more dramatic change in diagnostic approach, management, and outcomes than hypoplastic left heart syndrome (HLHS). During this time, survival to the age of 5 years (including Fontan) has ranged from 50% to 69%, but current expectations are that 70% of newborns born today with HLHS may reach adulthood. Although the 3-stage treatment approach to HLHS is now well founded, there is significant variation among centers. In this white paper, we present the current state of the art in our understanding and treatment of HLHS during the stages of care: 1) pre-Stage I: fetal and neonatal assessment and management; 2) Stage I: perioperative care, interstage monitoring, and management strategies; 3) Stage II: surgeries; 4) Stage III: Fontan surgery; and 5) long-term follow-up. Issues surrounding the genetics of HLHS, developmental outcomes, and quality of life are addressed in addition to the many other considerations for caring for this group of complex patients